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Optimal shrinkage estimation in heteroscedastic 
hierarchical linear models 


S. C. Kou and Justin J. Yang 


Abstract Shrinkage estimators have profound impacts in statistics and in scien¬ 
tific and engineering applications. In this article, we consider shrinkage estimation 
in the presence of linear predictors. We formulate two heteroscedastic hierarchical 
regression models and study optimal shrinkage estimators in each model. A class 
of shrinkage estimators, both parametric and semiparametric, based on unbiased 
risk estimate (URE) is proposed and is shown to be (asymptotically) optimal under 
mean squared error loss in each model. Simulation study is conducted to compare 
the performance of the proposed methods with existing shrinkage estimators. We 
also apply the method to real data and obtain encouraging and interesting results. 


1 Introduction 

Shrinkage estimators, hierarchical models and empirical Bayes methods, dating 
back to the groundbreaking works of ED and EH, have profound impacts in 
statistics and in scientific and engineering applications. They provide effective 
tools to pool information from (scientifically) related populations for simultaneous 
inference—the data on each population alone often do not lead to the most effec¬ 
tive estimation, but by pooling information from the related populations together 
(for example, by shrinking toward their consensus “center”), one could often ob¬ 
tain more accurate estimate for each individual population. Ever since the seminal 
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works of m and uni, an impressive list of articles has been devoted to the study of 
shrinkage estimators in normal models, including ll^[T3l l4ll5ll^[n i^[T5l [8lfmi^. 
among others. 

In this article, we consider shrinkage estimation in the presence of linear predic¬ 
tors. In particular, we study optimal shrinkage estimators for heteroscedastic data 
under linear models. Our study is motivated by three main considerations. First, in 
many practical problems, one often encounters heteroscedastic (unequal variance) 
data; for example, the sample sizes for different groups are not all equal. Second, 
in many statistical applications, in addition to the heteroscedastic response variable, 
one often has predictors. For example, the predictors could represent longitudinal 
patterns EUlEll, exam scores ll20l . characteristics of hospital patients JTtI . etc. 
Third, in applying shrinkage estimators to real data, it is quite natural to ask for the 
optimal way of shrinkage. 

The (risk) optimality is not addressed by the conventional estimators, such as the 
empirical Bayes ones. One might wonder if such an optimal shrinkage estimator 
exists in the first place. We shall see shortly that in fact (asymptotically) optimal 
shrinkage estimators do exist and that the optimal estimators are not empirical Bayes 
ones but are characterized by an unbiased risk estimate (URE). 

The study of optimal shrinkage estimators under the heteroscedastic normal 
model was first considered in ll25l . where the (asymptotic) optimal shrinkage es¬ 
timator was identified for both the parametric and semiparametric cases. Il26l ex¬ 
tends the (asymptotic) optimal shrinkage estimators to exponential families and het¬ 
eroscedastic location-scale families. The current article can be viewed as an exten¬ 
sion of the idea of optimal shrinkage estimators to heteroscedastic linear models. 

We want to emphasize that this article works on a theoretical setting somewhat 
different from f26i but can still cover its main results. Our theoretical results show 
that the optimality of the proposed URE shrinkage estimators does not rely on nor¬ 
mality nor on the tail behavior of the sampling distribution. What we require here are 
the symmetry and the existence of the fourth moment for the standardized variable. 

This article is organized as follows. We first formulate the heteroscedastic linear 
models in Sec. |2] Interestingly, there are two parallel ways to do so, and both are 
natural extensions of the heteroscedastic normal model. After reviewing the con¬ 
ventional empirical Bayes methods, we introduce the construction of our optimal 
shrinkage estimators for heteroscedastic linear models in Sec. [3] The optimal shrink¬ 
age estimators are based on an unbiased risk estimate (URE). We show in Sec.|4] 
that the URE shrinkage estimators are asymptotically optimal in risk. In Sec.|5]we 
extend the shrinkage estimation to a semiparametric family. Simulation studies are 
conducted in Sec.|6] We apply the URE shrinkage estimators in Sec.|7]to the baseball 
data set of El and observe quite interesting and encouraging results. We conclude 
in Sec. [S] with some discussion and extension. The appendix details the proofs and 
derivations for the theoretical results. 
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Fig. 1 Graphical illustration of the two heteroscedastic hierarchical linear models. 


2 Heteroscedastic Hierarchical Linear Models 

Consider the heteroscedastic estimation problem 

Y,\e"t^\yK{9„Ai), i=l,...,p, (1) 

where 6 = {9i,...,0p)^ is the unknown mean vector, which is to be estimated, and 
the variances A, > 0 are unequal, which are assumed to be known. In many sta¬ 
tistical applications, in addition to the heteroscedastic Y — {Y\^...,YpY, one often 
has predictors X. A natural question is to consider a heteroscedastic linear model 
that incorporates these covariates. Notation-wise, let denote the p in¬ 

dependent statistical units, where k, is the response variable of the /-th unit, and 
Xi = (Xi/,... jXiiiY is a k-dimensional column vector that corresponds to the k co¬ 
variates of the ;-th unit. The kx p matrix 

x = [Xi|...|Xp], Xl,..,XpeR^ 

where Xi is the /-th column of X, then contains the covariates for all the units. 
Throughout this article we assume that X has full rank, i.e., rank(X) = k. 

To include the predictors, we note that, interestingly, there are two different ways 
to build up a heteroscedastic hierarchical linear model, which lead to different struc¬ 
ture for shrinkage estimation. 

Model I: Hierarchical linear model. On top of ([T]i, the 0,’s are 0, ,yY (Xf i3, A), 
where and X are both unknown hyper-parameters. Model I has been suggested 
as early as 1221. See ifTSl and iflhl for more discussions. The special case of no 
covariates (i.e., k = 1 and if = [11 • • • 11]) is studied in depth in 1251 . 

Model II: Bayesian linear regression model. Together with ([TJ, one assumes 6 = 
X^P with j5 following a conjugate prior distribution j3 ~ M^(j3o,AW), where 
W is a known kx k positive definite matrix and jSg and X are unknown hyper¬ 
parameters. Model II has been considered in iflAl l^ fTSl among others; it includes 
ridge regression as a special case when jSg = 0^ and W = 7^. 
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Figure[T]illustrates these two hierarchical linear models. Under Model I, the pos¬ 
terior mean of 6 is 9^'^ = X{X U -f A,- (A +Ai)^^ Xj j3 for i = 1,so 

the shrinkage estimation is formed by directly shrinking the raw observation U 
toward a linear combination of the k covariates Xi. If we denote /J,- ~ XjP, and 
-S^row the row space of X, then we can rewrite the posterior 
mean of 9 under Model I as 

= + with/J e jSfrow w. (2) 

Under Model II, the posterior mean of 6 is 

withp^'^°=xw{xw+v)-^p'^^^+v{xw+vy^Po, o) 

where p = (XA X ) XA Y is the weighted least squares estimate of the 
regression coefficient, A is the diagonal matrix A = diag(Ai,...,Ap), and V = 
{XA^^X^y^. Thus, the estimate for 0, is linear in Xi, and the “shrinkage” is 

achieved by shrinking the regression coefficient from the weighted least squares 
iWLS 

estimate p toward the prior coefficient Pq. 

As both Models I and II are natural generalizations of the heteroscedastic nor¬ 
mal model ([I]l, we want to investigate if there is an optimal choice of the hyper¬ 
parameters in each case. Specifically, we want to investigate the best empirical 
choice of the hyper-parameters in each case under the mean squared error loss 


ip{e,e) 


1 

p 


0-0 


2 


-E ( 0,-00 

pti 


(4) 


with the associated risk of 0 defined by 


/?p(0,0) = Ey|e(/p(0,0)), 

where the expectation is taken with respect to Y given 0. 

Remark 1. Even though we start from the Bayesian setting to motivate the form 
of shrinkage estimators, our discussion will be all based on the frequentist setting. 
Hence all probabilities and expectations throughout this article are fixed at the un¬ 
known true 0. 

Remark 2. The diagonal assumption of A is quite important for Model I but not so 
for Model II, as in Model II we can always apply some linear transformations to 
obtain a diagonal covariance matrix. Without loss of generality, we will keep the 
diagonal assumption for A in Model II. 

For the ease of exposition, we will next overview the conventional empirical 
Bayes estimates in a general two-level hierarchical model, which includes both 
Models I and II: 
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y|0~^(0,A)ande~^p(M,B), (5) 

where S is a non-negative definite symmetric matrix that is restricted in an allowable 
set SS, and jU is in the row space ^iow/{X) of X. 

Remark 3. Under Model I, ju and B take the form of /i = and B G ^ = 
{A/p ; X > O}, whereas under Model II, jl and B take the form of fl ~ X ^and 
B G .^ = {^XX^WX : X > 0}. It is interesting to observe that in Model I, B is of full 
rank, while in Model II, B is of rank k. As we shall see, this distinction will have 
interesting theoretical implications for the optimal shrinkage estimators. 

Lemma 1. Under the two-level hierarchical model 0, the posterior distribution is 

e\Y ^jy'p{B{A+B)-^Y+A{A+B)-^H,A{A + B)-^B), 

and the marginal distribution ofY is Y ^ jVp (/i,A + S). 

For given values of B and fl, the posterior mean of the parameter 0 leads to the 
Bayes estimate 

0^-''=B(A+B)-'y+A(A + S)-'/i. (6) 

To use the Bayes estimate in practice, one has to specify the hyper-parameters in 
B and fl. The conventional empirical Bayes method uses the marginal distribution 
of y to estimate the hyper-parameters. For instance, the empirical Bayes maximum 
likelihood estimates (EBMLE) j^ebmle obtained by maximizing 

the marginal likelihood of Y: 

/^ebmle^^ebmleA argmax - (F - ju)^ (A+B)^'(y - ju) -log (det (A+ B)). 
^ ' Bess 

ne^rowix) 


Alternatively, the empirical Bayes method-of-moment estimates (EBMOM) 
and are obtained by solving the following moment equations for B G 

and jU G Jf^ow (X): 


fl=X^ (^X(A-hB)-^X^J ‘x(A + B)^*y, 

B={Y-pi){Y-^f -A. 

If no solutions of B can be found in iXS, we then set _ Qpxp. Adjustment 

for the loss of k degrees of freedom from the estimation of jU might be applicable for 
B = XC [C = Ip for Model I and X^WX for Model II); we can replace the second 
moment equation by 


( p 

\p-k tr(C) 
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.. • • . „ ... • iEBMLE -EBMOM . , 

The corresponding empirical Bayes shrinkage estimator 6 or 0 is then 

r- . • .'EBMLE /,frmIF\ ,#,EBM0M /^FBMOM\ • • in 

formed by plugging (B ,(l ) or (B ,/i ) into equation (|6]l. 


3 URE Estimates 


The formulation of the empirical Bayes estimates raises a natural question: which 

EBMLE EBMOM 

one is preferred 6 or 6 ? More generally, is there an optimal way to 

-EBMLE -EBMOM 

choose the hyper-parameters? It turns out that neither 9 nor 9 is op¬ 

timal. The (asymptotically) optimal estimate, instead of relying on the marginal 
distribution of Y, is characterized by an unbiased risk estimate (URE). The idea of 
forming a shrinkage estimate through URE for heteroscedastic models is first sug¬ 
gested in ll25l . We shall see that in our context of hierarchical linear models (both 
Models I and II) the URE estimators that we are about to introduce have (asymptot¬ 
ically) optimal risk properties. 

The basic idea behind URE estimators is the following. Ideally we want to find 
the hyper-parameters that give the smallest risk. However, since the risk function 
depends on the unknown 9, we cannot directly minimize the risk function in prac¬ 
tice. If we can find a good estimate of the risk function instead, then minimizing this 
proxy of the risk will lead to a competitive estimator. 

To formally introduce the URE estimators, we start from the observation that, 
under the mean squared error loss (|4li, the risk of the Bayes estimator 9 ^ for fixed 
B and fl is 


Bp(0,0®-^) = i A{A+B)-'{^l-9) ^ + -tr(B{A+B)-U{A+B)-'B), 

D D \ / 


(7) 

which can be easily shown using the bias-variance decomposition of the mean 
squared error. As the risk function involves the unknown 9, we cannot directly min¬ 
imize it. However, an unbiased estimate of the risk is available: 


URE(B,/J) 


A(A + B)-'(y-M)|p + ^tr(A-2A(A+B)-*A), 


(8) 


which again can be easily shown using the bias-variance decomposition of the mean 
squared error. Intuitively, if URE (B,jli) is a good approximation of the actual risk, 

then we would expect the estimator obtained by minimizing the URE to have good 

^ URE 

properties. This leads to the URE estimator 9 , defined by 


lURE 




-'Y+A(A+B )-'fl 


URE. 1 , URE 


(9) 


/^ure^^ureA URE(B,JU). 

^ Bern, 


where 
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In the URE estimator (|9]l, and are jointly determined by minimizing 
the URE. When the number of independent statistical units p is small or moderate, 
joint minimization of B and the vector /i, however, may be too ambitious. In this 
setting, it might be beneficial to set jti by a predetermined rule and only optimize 
B, as it might reduce the variability of the resulting estimate. In particular, we can 
consider shrinking toward a generalized least squares (GLS) regression estimate 

{XMX^) ^XMY = Pm,xY, 


where Af is a prespecified symmetric positive definite matrix. This use of gives 

the shrinkage estimate 0 ’ B{A+B)^'Y +A{A , where one only 

needs to determine B. We can construct another URE estimate for this purpose. 

Similar to the previous construction, we note that 6 has risk 


Rp{d,d -^ ) = - A{A+B)-^iIp-PM,x)e 

p 


1 


+ 


^tr (/p - A (A + B) - ‘ {Ip - Pm,x )) A (/p - A (A + B)- ^ (/p 



An unbiased risk estimate of it is 


UREM(B) = i||A(A+Br'(y-A"')|| +-tr (a - 2A (A+B)-'(/p-P m,x)a). 

^ ^ ( 11 ) 


Both (fTol i and (fTTIi can be easily proved by the bias-variance decomposition of mean 
squared error. Minimizing UREm (B) over B gives the URE GLS shrinkage estima¬ 
tor (which shrinks toward jU ): 


er=Ar(^+cyT+^(A+ir)"A‘', m 

where 

b'^^ = argminUREM (B). 

Bern 

Remark 4. When M = Ip, clearly the ordinary least squares regression 

estimate. When M = A^', then p^ = the weighted least squares regression 

estimate. 


4 Theoretical Properties of URE Estimates 

This section is devoted to the risk properties of the URE estimators. Our core theo¬ 
retical result is to show that the risk estimate URE is not only unbiased for the risk 
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but, more importantly, uniformly close to the actual loss. We therefore expect that 
minimizing URE would lead to an estimate with competitive risk properties. 


4.1 Uniform Convergence of URE 


To present our theoretical result, we first define .if to be a subset of .ifow (X^): 

^ = {Me^w(:x'):||Mll<Mpn|y||}, 

where M is a large and fixed constant and tc £ [0,1 /2) is a constant. Next, we intro¬ 
duce the following regularity conditions: 

(A) a2 = O (p); (B) A,02 = O (p); (C) iti Of = O (p); 

(D) p-^XAX'^ ilD\ (E) p-^XX'^ ~>n£>0; 

(E) p-'XA-‘X^ -^nF>0; (G) p-^XA-^X^ nc- 

The theorem below shows that URE (B, /i) not only unbiasedly estimates the risk 
but also is (asymptotically) uniformly close to the actual loss. 


Theorem 1. Assume conditions (A)-(E) for Model I or assume conditions (A) and 
(D)-(G) for Model II. In either case, we have 


sup 

Bess, fieSS 


URE(B,M)-/p(e,0^'‘) 


■Q in U, as p ' 


We want to remark here that the set .if gives the allowable range of /X: the norm 
of g is up to an o(p*/^) multiple of the norm of Y. This choice of .if does not 
lead to any difficulty in practice because, given a large enough constant M, it will 
cover the shrinkage locations of any sensible shrinkage estimator. We note that it is 
possible to define the range of sensible shrinkage locations in other ways (e.g., one 
might want to define it by oo-norm in K.'’), but we find our setting more theoretically 
appealing and easy to work with. In particular, our assumption of the exponent K < 
1 /2 is flexible enough to cover most interesting cases, including the ordinary 

least squares regression estimate, and the weighted least squares regression 

estimate (as in Remark 4) as shown in the following lemma. 

Lemma 2. (i) g (U) Assume (A) and (A') L^Li = O (p) for some 

5 > 0; then g ^ for jc = 4^^ + (4 + 25) * and a large enough M. 

Remark 5. We want to mention here that Theorem[T]in the case of Model I covers 
Theorem 5.1 of ll25l (which is the special case of k = 1 and X = [1111...| 1]) because 
the restriction of |/i| < max |y,| in ll25l is contained in as 


max |y,| = ( max 

l<Kp l<Kp 


yiy/l < (-^^, 2^/2 
/=1 


rll- 
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Furthermore, we do not require the stronger assumption of Y!i=i = 0{p) 

for some 5 > 0 made in ll25l . Note that in this case (^ = 1 and X = [1|1|...|1]) 
we do not even require conditions (D) and (E), as condition (A) directly implies 
tr((XX^) ^XAX^) = (9(1), the result we need in the proof of Theorem [T] for 
Model I. 

Remark 6. In the proof of Theorem[T] the sampling distribution of T is involved only 
through the moment calculations, such as E(tr(yy^ — A — 00 ^)2)andE(||yf).It 
is therefore straightforward to generalize Theorem[T]to the case of 

y, = 0, + v^z,-, 

where Z, follows any distribution with mean 0, variance 1, E (Z?) = 0, and E (Z(*) < 
oo. This is noteworthy as our result also covers that of ll2^ but the methodology we 
employ here does not require to control the tail behavior of Z, as in ||25]|23. 


4.2 Risk Optimality 

In this section, we consider the risk properties of the URE estimators. We will show 
that, under the hierarchical linear models, the URE estimators have (asymptotically) 
optimal risk, whereas it is not necessarily so for other shrinkage estimators such as 
the empirical Bayes ones. 

A direct consequence of the uniform convergence of URE is that the URE esti¬ 
mator has a loss/risk that is asymptotically no larger than that of any other shrinkage 
estimators. Eurthermore, the URE estimator is asymptotically as good as the oracle 
loss estimator. To be precise, let 0 be the oracle loss (OL) estimator defined by 
plugging 

(^OL-oL\ ^ ip(e,e^'^) 

^ ' Bern, /le-Z ^ ^ 

= argmin ||S(A+S)^^y+A(A— 0||^ 

Bern, 

into (|6l). Of course, 0°^ is not really an estimator, since it depends on the unknown 
0 (hence we use the notation 0°^ rather than 0 ). Although not obtainable in 

practice, 0 lays down the theoretical limit that one can ever hope to reach. The 
next theorem shows that the URE estimator 0 is asymptotically as good as the 
oracle loss estimator, and, consequently, it is asymptotically at least as good as any 
other shrinkage estimator. 

Theorem 2. Assume the conditions ofTheorem\J\and that g j§f. Then 


jimP(/;,(0,0^^) >/p(0,0°^)+e) =0 Ve>0, 
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limsup(/?p(0,0'^'^’^) -7;p(0,0°^)) =0. 

Corollary 1. Assume the conditions ofTheorem\I\and that g Then for any 

estimator d =Bp{A + Bp) ^Y+A(A + Bp) ^ ftp with Bp £ and ftp € 
we always have 


jim P (^Ip ( 0 ,0^*^) > Ip (^0, = 0 

iimsup(^Rp(^e,e^^^'^ -Rp(^d,e^‘”^'’^^ <o. 


Ve > 0, 


Corollary [T] tells us that the URE estimator in either Model I or II is asymptoti¬ 
cally optimal: it has (asymptotically) the smallest loss and risk among all shrinkage 
estimators of the form (|6]l. 


4.3 Shrinkage toward the Generalized Least Squares Estimate 


The risk optimality also holds when we consider the URE estimator 0j^ that 
shrinks toward the GLS regression estimate = Pmx^ introduced in Sec. [3] 

Theorem 3. Assume the conditions of Theorem\I] fl^ £ , and 

p-'^xMx'^ p-^xAMx^^a2, p^^xma^mx^ ^ (i3) 


where only the first and third conditions above are assumed for Model I and only 
the first and the second are assumed for Model II. Then we have 


sup 

Bess 


UREm (B) - Ip 



0 in 


as p ^ 


(14) 


^ A, /V _ J _ \ w 

As a corollary, for any estimator 0 = Bp (A +Bp^ Y +A(A +Bp) JU 

with Bp £ we always have 

h;mP(^lp(0,0)^'^’^) >/p(^0,0^'”''''^+e^ =0 Ve>0, 
limsup ^Rp (^ 0 , 0 ],^*^^-Rp ^0,0^'”^ )) “^' 

Remark 7. Eor shrinking toward where M — Ip, we know from Lemma|2]that 
p. is automatically in Jf, so we only need one more condition p^ 'XA X Cii, 
for Model I. Eor shrinking toward , where M = (fTST l is the same as the 
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conditions (E) and (F) of Theorem[T] so additionally we only need to assume (A') 
of Lemma|2and (F) for Model I. 


5 Semiparametric URE Estimators 

- URE 

We have established the (asymptotic) optimality of the URE estimators 6 and 
in the previous section. One limitation of the result is that the class over 
which the URE estimators are optimal is specified by a parametric form: fi = AC 
(0 < A < oo) in equation (|6]l, where C — Ip for Model I and C = X^WX for Model 
II. Aiming to provide a more flexible and, at the same time, efficient estimation pro¬ 
cedure, we consider in this section a class of semiparametric shrinkage estimators. 
Our consideration is inspired by ll25l . 


5.1 Semiparametric URE Estimator under Model I 

To motivate the semiparametric shrinkage estimators, let us first revisit the Bayes 

estimator 6 under Model I, as given in (|2|i. It is seen that the Bayes estimate 
of each mean parameter 0, is obtained by shrinking T,- toward the linear estimate 
and that the amount of shrinkage is governed by A,-, the variance: the 
larger the variance, the stronger is the shrinkage. This feature makes intuitive sense. 

With this observation in mind, we consider the following shrinkage estimators 
under Model I: 


ef= (1 - bi) Yi + biPi, with p e jsfrow (^), 

where b satisfies the monotonic constraint 

MON (A) : bi € [0, 1], bi < bj whenever A,- < Aj. 

MON (A) asks the estimator to shrink more for an observation with a larger variance. 
Since other than this intuitive requirement, we do not post any parametric restriction 
on bi, this class of estimators is semiparametric in nature. 

Following the optimality result for the parametric case, we want to investigate, 

for such a general estimator d ^ with b € MON (A) and p G -Sfrow (X), whether 
there exists an optimal choice of b and p. In fact, we will see shortly that such an 
optimal choice exists, and this asymptotically optimal choice is again characterized 
by an unbiased risk estimate (URE). For a general estimator 6 with fixed b and 
p G Jfrow (X), an unbiased estimate of its risk Rp{B, 6 is 

URE^^ {b,p) = i lldiag (b) {Y - p)f + Itr (A - 2diag (Z,)A), 
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which can be easily seen by taking B = A(diag (b) ' — Ip) in (|8j. Note that we use 

the superscript “SP” (semiparametric) to denote it. Minimizing over b and [l leads 

^ URE 

to the semiparametric URE estimator 9gp , defined by 

iURE , ,^URE., /fURE. ^rjRE 


where 


(lPsr,flT) = argmin URE^^ {b,n). 

^ ' *GMON(Af fieJfmwiX) 


Theorem 4. Assume conditions (A)-(E). Then under Model I we have 


sup 

i)GMON(A). fi&Jf 


VRE^^{b,n) 


.(e, 


b.fi 


0 in U as p ' 


As a corollary, for any estimator 9 = (Ip — diag(bp))Y + dia.g(bp)flp with bp G 

MON (A) and ftp € J§f, we always have 

jimp(^/p( 0 , 05 ^) >/p(^0,0*'"'‘'’^+e^ =0 Ve>0, 

limsup ^Rp ^0,0^^^ —Rp ^0,0*'”^'’^^ < 0. 

The proof is the same as the proofs of Theorem[T]and Corollary [T]for the case of 
Model I except that we replace each term of A,/(A +A,) by bt. 


5.2 Semiparametric URE Estimator Under Model II 

We saw in Sec. |2]that, under Model II, shrinkage is achieved by shrinking the re- 

- WLS 

gression coefficient from the weighted least squares estimate p toward the prior 
coefficient jSg. This suggests us to formulate the semiparametric estimators through 
the regression coefficient. The Bayes estimate of the regression coefficient is 

=XW{XW + V)-^p'^^^+V{XW + V)-^ Po, withV = 

as shown in @. Applying the spectral decomposition on gives 

^ where A = dmg{d\,...,dk) with d\ < ■ ■ ■ < dk- Using 
this decomposition, we can rewrite the regression coefficient as 

If we denote Z = U'^'W^I^X as the transformed covariate matrix, the estimate 
0^ of 0 can be rewritten as 
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0^’^“ =z^ (x+A {Xh+Ay^u^w-^/^Po^. 

Now we see that X {Xy+Ay^ = diag(A/(A +di)) plays the role as the shrink¬ 
age factor. The larger the value of di, the smaller X/{X + dj), i.e., the stronger the 
shrinkage toward Pq. Thus, di can be viewed as the effective “variance” compo¬ 
nent for the /-th regression coefficient (under the transformation). This observation 
motivates us to consider semiparametric shrinkage estimators of the following form 

^b.po ^ ^ 

= ((4 - diag {b))AZA-'Y + diag (b) , (16) 

where b satisfies the following monotonic constraint 

MON (D) : bi € [0,1], bi < bj whenever di < dj. 


This constraint captures the intuition that, the larger the effective variance, the 
stronger is the shrinkage. 

Vox fixed b and jSg, an unbiased estimate of the risk Rp{d,0^ '•^“)is 


URE^^ (Z>,i3o) = ^ ih - diag( 6 )) AZA-'y TZ^diag {b) - Y 

+ -tr(2Z^(4-diag {b))AZ-A ), 

P 

which can be shown using the bias-variance decomposition of the mean squared 
error. Minimizing it gives the URE estimate of (fe, jSg): 



argmin URE^^ (Z>, j3 g), 

i>GMON(D). PoeR’’ 


which upon plugging into (fTbl) yields the semiparametric URE estimator 65 ^^ un¬ 
der Model II. 


Theorem 5. Assume conditions (A), (D)-(G). Then under Model 11 we have 


sup 

i>GMON(D).X^^o6^ 


uRE"^(4j3o)-4(e,e*'^“) 


■Q in y as p ^ ‘ 


As a corollary, for any estimator 6 obtained from ( 1761) with bp £ MON (D) 
andX^^Q £ we always have 


lim P I / 

p^oo 


^( 0,1 


URE 


'SP J ^ I-P 


>4 0,0 


ypfio.p 


-f £ = 0 V£ > 0 , 
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limsup (^Rp (e, -Rp ^ < 0. 

The proof of the theorem is essentially identical to those of Theorem[T]and Corol- 
larydJfor the case of Model II except that we replace each + c/,) by bi. 


6 Simulation Study 


In this section, we conduct simulations to study the performance of the URE es¬ 
timators. For the sake of space, we will focus on Model I. The four URE estima- 
tors are the parametric 0 of equation (|9l), the parametric of equation (fT^ 
that shrinks toward the OLS estimate ju (i.e., the matrix M ~ Ip), the semi- 

• aURE ^ , . . iURE.OLS , , . , 

parametric d^p of equation (115b . and the semiparametnc dgp that shrinks 
toward which is formed similarly to 0j|^ by replacing A,/(A +A,) with a 

sequence b € MON (A). The competitors here are the two empirical Bayes esti- 

lEBMLE , -EBMOM , , . . , ^ . . - JS+ 

mators 0 and 0 , and the positive part James-Stern estimator 0 as 

described in ll^ fThl : 


qJS+ 





p — k — 2 




{Yt-firy. 


~ OR 

As a reference, we also compare these shrinkage estimators with 0 , the para¬ 

metric oracle risk (OR) estimator, defined as plugging X^^Ip and into equation 
(0), where 

argmin Rp(0,0^’'') 

^ ^ 0<A<o=,flG.i?row{Jf) ^ ^ 

and the expression of Rp{d,6^’^) is given in (|7]i with B = A/p. The oracle risk 
~ OR 

estimator 0 cannot be used without the knowledge of 0, but it does provide a 
sensible lower bound of the risk achievable by any shrinkage estimator with the 
given parametric form. 

For each simulation, we draw (A,, 0,) {i = 1,2, ...,p) independently from a dis¬ 
tribution 7r(A,, 0, |X,',/3) and then draw Yt given (A,, 0,). The shrinkage estimators 
are then applied to the generated data. This process is repeated 5000 times. The 
sample size p is chosen to vary from 20 to 500 with an increment of length 20. In 
the simulation, we fix a true but unknown j3 = (—1.5,4,—3)^ andaknown covari¬ 
ates X, whose each element is randomly generated from Unif(—10,10). The risk 
performance of the different shrinkage estimators is given in Figure|2] 

Example 1. The setting in this example is chosen in such a way that it reflects 
grouping in the data: 


{4,=0.1} +0-5 • 1{A,=0.5}; 


A, ~ 0.5 • 1 



Optimal shrinkage estimation in heteroscedastic hierarchical linear models 


15 


0,|A,- ^ (2 • +Xjp,0.5 ^); Yt ^ A^(0,-,A,). 

Here the normality for the sampling distribution of y,’s is asserted. We can see that 
the four URE estimators perform much better than the two empirical Bayes ones 
and the James-Stein estimator. Also notice that both of the two (parametric and 
semiparametric) URE estimators that shrink towards jX is almost as good as the 
other two with general data-driven shrinkage location—largely due to the existence 
of CO variate information. We note that this is quite different from the case of ll25l . 
where without the covariate information the estimator that shrinks toward the grand 
mean of the data performs significantly worse than the URE estimator with general 
data-driven shrinkage location. 

Example 2. In this example, we allow U to depart from the normal distribution 
to illustrate that the performance of those URE estimators does not rely on the nor¬ 
mality assumption: 


A,:-Unif(0.1,l); 0; = A,- 
Yi - Unif(0i - V3A;, 0; -f Vm,). 

As expected, the four URE estimators perform better or at least as good as the 
empirical Bayes estimators. The EBMLE estimator performs the worst due to its 
sensitivity on the normality assumption. We notice that the EBMOM estimator in 
this example has comparable performance with the two parametric URE estimators, 
which makes sense as moment estimates are more robust to the sampling distri¬ 
bution. An interesting feature that we find in this example is that the positive part 
James-Stein estimator can beat the parametric oracle risk estimator and perform bet¬ 
ter than all the other shrinkage estimators for small or moderate p, even though the 
semiparametric URE estimators will eventually surpass the James-Stein estimator, 
as dictated by the asymptotic theory for large p. This feature of the James-Stein 
estimate is again quite different from the non-regression setting discussed in ll25l . 
where the James-Stein estimate performs the worst throughout all of their examples. 
In both of our examples only the semiparametric URE estimators are robust to the 
different levels of heteroscedasticity. 

We can conclude from these two simulation examples that the semiparametric 
URE estimators give competitive performance and are robust to the misspecification 
of the sampling distribution and the different levels of the heteroscedasticity. They 
thus could be useful tools in analyzing large-scale data for applied researchers. 


7 Empirical Analysis 

In this section, we study the baseball data set of El. This data set consists of the 
batting records for all the Major League Baseball players in the 2005 season. As 
in El and ESl . we build a given shrinkage estimator based on the data in the 
first half season and use it to predict the second half season, which can then be 
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W 

ir 


• Parametric Oracle 

James-Stein 

0 Parametric LIRE toward OLS 

EBMOM 

# Parametric LIRE 

□ Semiparametric LIRE toward OLS 

EBMLE 

V Semiparametric LIRE 



Example 1 


Example 2 


■ 1 


V 

1 

- 1.+++++++++++++++++++++++ 

0.35 

_l_ 


- 


d 


o 

CO _ 

1 


o 



).25 

_J_ 

\ ‘OW 

• • 


-1-^^-1-r 


Fig. 2 Comparison of the risks of different shrinkage estimators for the two simulation examples. 


checked against the true record of the second half season. For each player, let the 
number of at-bats be N and the successful number of batting be H, then we have 
Hij ~ Binomial where i = 1,2 is the season indicator and j = I,-■■ ,p is 
the player indicator. We use the following variance-stabilizing transformation El 
before applying the shrinkage estimators 


Yij = arcsin ^ 


Hi, 


1/4 


Nij +1/2^ 


which gives Yij'^N{0j, {^Ntj) '), 0/ = arcsinWe use 

TSE(e) = E(F2;-e,)^-L^ 

i j 

as the error measurement for the prediction El- 


7.1 Shrinkage Estimation with Covariates 

As indicated in EH, there exists a significant positive correlation between the 
player’s batting ability and his total number of at-bats. Intuitively, a better player 
will be called for batting more frequently; thus, the total number of at-bats will 
serve as the main covariate in our analysis. The other covariate in the data set is the 
categorical variable of a player being a pitcher or not. 

Table [U summarizes the result, where the shrinkage estimators are applied three 
times—to all the players, the pitchers only, and the non-pitchers only. We use all the 
covariate information (number of at-bats in the first half season and being a pitcher 
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All Pichers Non-pichers 


p for estimation 
p for validation 

567 

499 

81 

64 

486 

435 

Covariates? 

No 

Yes 

No 

Yes 

No 

Yes 

Naive 

1 

NA 

1 

NA 

1 

NA 

Ordinary least squares (OLS) 

0.852 

0.242 

0.127 

0.115 

0.378 

0.333 

Weighted least squares (WLS) 

1.074 

0.219 

0.127 

0.087 

0.468 

0.290 

Parametric EBMOM 

0.593 

0.194 

0.129 

0.117 

0.387 

0.256 

Parametric EBMLE 

0.902 

0.207 

0.117 

0.096 

0.398 

0.277 

James-Stein 

0.525 

0.184 

0.164 

0.142 

0.359 

0.262 

Parametric URE toward OLS 

0.505 

0.203 

0.123 

0.124 

0.278 

0.300 

Parametric URE toward WLS 

0.629 

0.188 

0.127 

0.112 

0.385 

0.268 

Parametric URE 

0.422 

0.215 

0.123 

0.130 

0.282 

0.310 

Semiparametric URE toward OLS 

0.409 

0.197 

0.081 

0.097 

0.261 

0.299 

Semiparametric URE toward WLS 

0.499 

0.184 

0.098 

0.083 

0.336 

0.256 

Semiparametric URE 

0.419 

0.201 

0.077 

0.126 

0.278 

0.314 


Table 1 Prediction errors of batting averages using different shrinkage estimators. Bold numbers 
highlight the best perfonnance with covariate(s) in each case. 


or not) in the first analysis, whereas in the second and the third analyses we only 
use the number of at-bats as the covariate. The values reported are ratios of the error 
of a given estimator to that of the benchmark naive estimator, which simply uses 
the first half season Yij to predict the second half Y 2 j. Note that in Table [U if no 
covariate is involved (i.e., when X = [11 • • • 11]), the OLS reduces to the grand mean 
of the training data as in ll25l . 


7,2 Discussion of the numerical result 

There are several interesting observations from Table[T] 

(i) A quick glimpse shows that including the covariate information improves the 
performance of essentially all shrinkage estimators. This suggests that in practice 
incorporating good covariates would significantly improve the estimation and pre¬ 
diction. 

(ii) In general, shrinking towards WLS provides much better performance than 
shrinking toward OLS or a general data-driven location. This indicates the impor¬ 
tance of a good choice of the shrinkage location in a practical problem. An improp¬ 
erly chosen shrinkage location might even negatively impact the performance. The 
reason that shrinking towards a general data-driven location is not as good as shrink¬ 
ing toward WLS is probably due to that the sample size is not large enough for the 
asymptotics to take effect. 

(iii) Table[T]also shows the advantage of semiparametric URE estimates. For each 
fixed shrinkage location type (toward OLS, WLS, or general), the semiparametric 
URE estimator performs almost always better than their parametric counterparts. 
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- EBMOM 


Parametric URE toward OLS 

— Semiparametric URE toward OLS 

- - EBMLE 

-e- 

Parametric URE toward WLS 

— Semiparametric URE toward WLS 

• •• • James-Stein 

-Hr- 

Parametric URE 

Semiparametric URE 


Without Covariates With Covariates 



case of all players. 


The only one exception is in the non-pitchers only case with the general data-driven 
location, but even there the performance difference is ignorable. 

(iv) The best performance in all three cases (all the players, the pitchers only, and 
the non-pitchers only) comes from the semiparametric URE estimator that shrinks 
toward WLS. 

(v) The James-Stein estimator with covariates performs quite well except in the 
pitchers only case, which is in sharp contrast with the performance of the James- 
Stein estimator without covariates. This again highlights the importance of covariate 
information. In the pitchers only case, the James-Stein performs the worst no matter 
one includes the covariates or not. This can be attributed to the fact that the covariate 
information (the total number of at-bats) is very weak for the pitchers only case; 
in the case of weak covariate information, how to properly estimate the shrinkage 
factors becomes the dominating issue, and the fact that the James-Stein estimator 
has only one uniform shrinkage factor makes it not competitive. 


7.3 Shrinkage Factors 

Figure|3]shows the shrinkage factors of all the shrinkage estimators with or without 
the CO variates for the all-players case of Table [1] We see that the shrinkage factors 
are all reduced after including the covariates. This makes intuitive sense because 
the shrinkage location now contains the covariate information, and each shrinkage 
estimator uses this information by shrinking more toward it, resulting in smaller 
shrinkage factors. 
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8 Conclusion and Discussion 

Inspired by the idea of unbiased risk estimate (URE) proposed in ||25]| . we extend 
the URE framework to multivariate heteroscedastic linear models, which are more 
realistic in practical applications, especially for regression data that exhibits het- 
eroscedasticity. Several parallel URE shrinkage estimators in the regression case 
are proposed, and these URE shrinkage estimators are all asymptotically optimal in 
risk compared to other shrinkage estimators, including the classical empirical Bayes 
ones. We also propose semiparametric estimators and conduct simulation to assess 
their performance under both normal and non-normal data. Eor data sets that exhibit 
a good linear relationship between the covariates and the response, a semiparamet¬ 
ric URE estimator is expected to provide good estimation result, as we saw in the 
baseball data. It is also worth emphasizing that the risk optimality for the parametric 
and semiparametric URE estimators does not depend on the normality assumption 
of the sampling distribution of E. 

We conclude this article by extending the main results to the case of weighted 
mean squared error loss. 

Weighted mean squared error loss. One might want to consider the more gen¬ 
eral weighted mean squared error as the loss function: 



where ij/j > 0 are known weights such that t//, = p. The framework proposed in 
this article is straightforward to generalize to this case. 

Eor Model II, we only need to study the equivalent problem by the following 
transformation 


E —>■ y^E', 9i —>■ Xi —>■ Ai —>■ y/jA,', 


(17) 


and restate the corresponding regularity conditions in Theorem[T]by the transformed 
data and parameters. We then reduce the weighted mean square error problem back 
to the same setting we study in this article under the classical loss function (|4]i. 

Model I is more sophisticated than Model II to generalize. In addition to the 
transformation in equation ( fTTl i. we also need X — \j//X in every term related to the 
individual unit i. Thus, 


y^e,\x,p,x 


so these transformed parameters are also heteroscedastic in the sense that 

they have different weights, while the setting we study before assumes all the 
weights on the 0, are one. However, if we carefully examine the proof of Theorem 
[T]for the case of Model I, we can see that actually we do not much require the equal 
weights on the 0,’s. What is important in the proof is that the shrinkage factor for 
unit i is always of the form A,/ (A, 4- A), which is invariant under the transformation 
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Aj xj/iAi and X —?> xj/iX. Thus, after reformulating the regularity conditions in The- 
orem[T]by the transformed data and parameters, we can still follow the same proof 
to conclude the risk optimality of URE estimators (parametric or semiparametric) 
even under the consideration of weighted mean squared error loss. 

For completeness, here we state the most general result under the semiparametric 
setting for Model I. Let 

;.URE / /fURE\\„ /fUREN^URE 

^SP.iir — j j ’ 

URE V) = -t ¥. (bf (Yi - liif + (1 - 2bi)A) , 
argmin URE(£>,/X;t/r). 

^ ^ 6GMON(A),jiG.5f„„(Y) 


Theorem 6. Assume the following five conditions (i/-A) YJi=\ = O (p), (V^-B) 

wfAiO^ = 0{p), iw-C) Iti ^ xiffA,X,Xf con¬ 
verges, and (</-E) p —^^y/>0. Then we have 


sup 

beMON{A), fieJfy, 


\jRE{b,n-¥)-ip{e,e‘’''-¥) 


0 in l}, 

p^oo 


where fi € -Sfy if and only if fi € J^^ow and 

1=1 !=1 


for a large and fixed constant M and a fixed exponent KG [0,1 /2). Ai a corollary, 

for any estimator 6 = (Ip — diag(bp))Y + diag(hp)/tp with bp G MON (A) and 

Ap £ "2V> we have 


jimp(^/p(e, 05 p“) >lp(^0,e'’'”^'’^+e^ =0 Ve>0, 
hm^sup (^Rp (e,e^p^) -Rp (^e,^ < o. 


Appendix: Proofs and Derivations 


Proof of LemmalU We can write 6 = p+Zi andT 

[Y 

and Z 2 ~ M^(0,A) are independent. Jointly I 


= 0 +Z 2 , whereZi ~M^(0,S) 
is still multivariate normal with 
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mean vector ( ^ ) and covariance matrix 

vM, 


A+BB 
B B 


. The result follows immedi¬ 


ately from the conditional distribution of a multivariate normal distribution. 

Proof of Theorem [IJ We start from decomposing the difference between the 
URE and the actual loss as 


= URE (B,Op) - Ip (e, - -tr (a {A+B)-^h {Y - 

= ltr(: 
p V 


YY^ -A-dd‘ ]-^ir(B{A+By^ (yY^-Y d'^-A'^'^ 


-^tr(A(A+B)-V(y-0)^) 

= (I) + (II) + (III). 

To verify the first equality (fTSl l. note that 
URE(B,/i)-URE(B,Op) 

= i||A(A + B)-‘(y-Ai)|p-^||A(A + B)-‘F 

= --trU^(A(A+B)-^)^A(A+B)-'(2y-^) 


(18) 


(19) 


= i||(/p-A(A+B)-^)y+A(A+B)-'^-0|p-i||(/p-A(A+B)-^)y-0 
= ^tr(^/x^(A(A + B)-‘)^(2((/p-A(A+B)-')y-0)+A(A+B)-^M)) ■ 

(fTSl) then follows by rearranging the terms. To verify the second equality ( fT9] l. note 

URE(B,Op)-lp(0,0^-°'') 

1 I 2 1 / ,x 2 

= - A(A+B)-'y -- (/„-A(A+B)-My-0 

P P ^ 2 

tr(A-2A(A+B)^‘A) 

-trny-2(/p-A(A+B)-^)y + 0)^(y-0)j +-tr(A-2A(A + B)-‘A 


= ^tr(yy2'-A-00^) -^tr(B(A+B)-^ (y(y-0)^-A)) . 

With the decomposition, we want to prove separately the uniform L* convergence 
of the three terms (I), (II), and (III). 


























22 


S. C. Kou and Justin J. Yang 


Proof for the case of Model I. 

The uniform L? convergence of (I) and (II) has been shown in Theorem 3.1 of 
USD under our assumptions (A) and (B), so we focus on (III), i.e., we want to show 
that sup I (III) I —s> 0 in L* as p —y 

Without loss of generality, let us assume A j < A 2 < • • • < Ap. We have 




sup 1(111)1 = — sup 

0<X<oo.ixe^ P 0<l<o=.tie^ 


< — sup sup 

P ne^0<ci< -<cp<\ 

where the last equality follows from Lemma 2.1 of ifT^ . For a generic p-dimensional 
vector V, we denote \y\j-,p = (0,.. .0, V;, vy+i,..., Vp). Let Px = be 

the projection matrix onto .5frow (-X^)- Then since ^ C .5frow (-S^), we have 


p 

2 

P 

£QM,(y,-0,) 

= — sup max 

Y.P^{Yi-ei) 

1=1 

P tie^^^J^P 

i=j 


— sup max 


tpiiYi-e.) 


l=J 

Ti 


= — max sup \u'^\Y — 9]j-„\ 

P^<j<P,iey ' 


2 2 

= - max sup \lx'^Px[Y -d]j:p\ < - max sup ||ju|| x \\Px[Y - e]j-.p\ 
P^<j<Ptie^ P^<j<PtieJf 

= - max Mp'^\\Y\\ x ||Px[l'- e];:p|| ■ 
p i<]<p 

Cauchy-Schwarz inequality thus gives 


E sup 1(111)1 <2Mp 

\0<X<oo,fie^ J 


"-yE(||yf)xyt( 


max \\Px[Y - d]j:p\ 


( 20 ) 


It is straightforward to see that, by conditions (A) and (C), 

^E[\\Yf) = = \/LL = o{p^/^) . 

For the second term on the right hand side of (l20l i. let Px = PDF^ denote the 
spectral decomposition. Clearly, 


D = diag 


/ 

1^, 0^ 

y k copie.s p—k copie.s j 


It follows that 


E (rnnxJ\Px[Y - e]j:p\n = E f mt«^[y - e]lpPx[Y - d 


J-P 
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= E m^x^tr [or^lY - 6],., {r^[Y - 0];:^)^)) = E [r^lY - ? 


= ® I im 


< E £ max £ [r^] {Y,„ - 0„) = £ E max £ [F^] (F,„ - 0, 


^<i<p 


\m=j 


l=l \ ^-j-P 


^m=J 


t(0 _ 

~ ^m=p-j+l Urn 
LP maximum inequality, 


For each I, Mf' = YJ,'n=n- /+i 1-^^]; ~ ^m) forms a martingale, so by Doob’s 


E (mf) j < 4E (<^) = 4E (^ £ [F^ (y„, - 0„,) 

= 4E[rnlA.=4[F"AF]„. 


m=l 


Therefore, 


Ef maxJ|Px[F-0];:p|n < I:4[F^AF] 


i=\ 


= 4 E [D]u [r^AF] „ = 4 tr (DF^AF) = 4 tr {P^A) 

/=i 

= 4tr(x^(X’X^)^AA) =4tr((XX’^)^AAX^) =0(1), 

where the last equality uses conditions (D) and (E). We finally obtain 

e( sup 1(111)1 ) xdfp'/z') X 0(1) =o(l). 

\o<X<oo. neJP J \ J \ J 


Proof for the case of Model II. 

Under Model II, we know that 

p 

Y^Aidf = e^Ad = p^(XAX^)p =0{p) 

i=l 

by condition (D). In other words, condition (D) implies condition (B). Therefore, 
we know that the term (I) -^O inl? as shown in Theorem 3.1 of ll25l . and we only 
need to show the uniform L* convergence of the other two terms, (II) and (III). 
Recall that B d = {^XX^WX : A > 0} has only rank k under Model II. We 

can reexpress (II) and (III) in terms of low rank matrices. Let V = {XA^^X^^ *. 
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Woodbury formula gives 

(A + fi)^' = {A + XX'^WXY^ =A-^-A-^XX'^ 

= A-^-A-^XX'^W{XW+ VY^VXA-\ 


which tells us 

B{A+B)-^ =Ip-A{A+B)-^ =XX'^W{XW + Vy^VXA-\ 

Let UAU^ be the spectral decomposition of i.e., = 

UAU^, where A = dmg{di, ...,dk) with d\ < ■■■ < dk- Then (AW + V) ^ = 

(^Ut + W-^I^VW-^I^Y^W-^l^ = W-^/^V{Xh+A)-^U'^W-^l^, from 
which we obtain 

B (A +B)-^ = XX'^W (AW + VXA ^ = XX'^W^'^U {Xh+Ay^AU'^W^I^XA 

If we denote Z = I/^W'/^X, i.e., Z is the transformed covariate matrix, then 
B(A+B)^^ = AZ^ {Xlk+Ay'^AZA -\It follows that 

(II) = -^tr(B(A+fi)^‘ (yr^-ye^-A)) 

= -^tr(Az^ {Xh+A)-^AZA-^ (yy^-ye^-A^) 

= -^tr (a {Xh+Ay^ aza-^ (yy^ - ve ^- a)z^) , 

(III) = -^tr(A(A+B)-V(l'-e)^) 

= -^tr((/p-AZ^ (A4+A)^^AZA-')/j(y-e)^) 

= -^tr(/x (y-0)^) +^tr(A [Xh+Ay^AZA-^n{Y-eYz'^^ 

= (III)i +(111)2. 


We will next show that (II), (III)^, and ( 111)2 all uniformly converge to zero in L*, 


which will then complete our proof. 

Let S = ZA-^ (yy^ - YB^ -A) Z^. Then 


sup I (II) I 

0<A<oo 


2 

— sup 

P 0<A<oo 


k 


L 


Xdi 
X +di 



2 

< — sup 

P 0<Ci<"-<Cj^<dj^ 


k 


Eq[S] 


a 


2 

— max 
P l<j<k 
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where the last equality follows as in Lemma 2.1 of IfTSl . As there are finite number 
of terms in the summation and the maximization, it suffices to show that 

dk //? —> 0 in for all 1 < i <k. 

To establish this, we note that [S],,- = 'Lm=\ (^n %t - Om) - 5„m) [Z];„ [Z]-^, 

= L IE {Y,„ - e,„) - 5n,„) {A-,% {Ym' - Om') ” 5„w)) 

n.m,n' 

^ [^]m [^]im [^]in' ' 

Depending on n,m,n',m' taking the same or distinct values, we can break the sum¬ 
mation into 15 disjoint cases: 

L + L + L + L 

all distinct three distinct, n=m three distinct, n=n' three distinct, n=m' 

+ L + L + L+ L 

three distinct, m=n' three distinct, m=m' three distinct, n'=m' two distinct, n=m, n'=m' 

+ L + L+ L + L 

two distinct, n=n', m=m' two distinct, n=m', n'=m two distinct, n=m=n' two distinct, n=m=m' 

+ L + L + L • 

two distinct, n=n'=m' two distinct, m=n'=m' n=m=n'=m' 


Many terms are zero. Straightforward evaluation of each summation gives 

E ([S]-) = t ^{{^n'Yn {Yn - B„) - l)') [Z]t 

n=l 

+ E E E ((A-'F„ (y„, - d,n)f) lZ]f„ [Z]l 

n=\m^n 

+ E E E ((A,7'y„ {Yn, - d,„)) {A-Xr {Yn - 9n))) [Z]l [Z]l 

n=\m^n 


-2 E E E ((A7'T„ {Yn - On) - 1) (A^'y. {Yn - On))) [Z]l [Z],„ 

n=\m^n 

E E E((Ar„'y„,(y„-0„))(A7/y„,(y„-0„)))[z]2jz],„,[Z],„, 

n= 1 m^n' X 

P 


j, 2A^ pj, A„A.,+A,ei ^ 

jj—l -- 1 - 


n=lm^n 


P 

E 

n=\m^n 


+2EE^|ziUzi 

n=\m^n 


y y 

A A I 

n=\m^n'Xj^n,m^n ” 
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An 


= L T^[Z]UZ]i 


n,m=\ ^ 


P 

L 

^,m=l 


\z]i [Zl; 


p 

L 


^n' 

I ^m^n' 


'zil fzi 


Using matrix notation, we can reexpress the above equation as 
e([S] 2) = [ZAZn,[ZA-'zn,+ [ZZ"]J.+ [ZAZn,[ZA-'0]J 

< tr (ZAZ^) tr (ZA^'Z^) + tr (ZZ^) V tr (ZAZ^) tr (e^A-'Z^ZA^* e) 
= tr(WXAX^)tr(WXA-^Z^)+tr(WXZ^)^ 

+ tr {WXAX'^) tr (j3^ (ZA-'X^) W (ZA-^Z^) jS) , 

which is O {p) 0{p) + 0 (p)^ + 0{p)0 (p^) = O (p^) by conditions (D)-(F). Note 
also that condition (F) implies 





= tr(W-‘V) =tr(W-^(ZA-^Z^)-') =C>(p-'). 


Therefore, we have 

e(4[S]- /p^) =0{p-^)0{p^)/p^ = 0{p-^) ^0, 


which proves 

sup 1(11)1—J 0 in L^, as p —>■ oo. 

0 <A<co 

To prove the uniform convergence of (III)j to zero in L*, we note that 


sup |(iii)ii = - sup |/i^(y-e)| = - sup \fi'^Px{Y-e)\ 
lie^ P iieJf P iie^ 

< - sup IlMlI X ||P;f (T- 0)11 = -Mp^rn x ||Px(y- 0)|1, 

P 


so by Cauchy-Schwarz inequality 

e|^™p|(III)i|^ <2Mp’^-^^E(\\Yf)^E{\\PxiY-e)f). (21) 

Under Model II, 0 = Z^jS, so it follows that 0? = ||0f = tr (j3j3^ZZ^) = 

O (p) by condition (E). Hence ^E ^||F|p^ = y^LjLi (0,^ + 2 I 1 ) = (9 (p'/^). For the 
second term on the right hand side of (l2TI) . note that 

E(||Pi:(y- 0 )|| 2 ) =E(tr(Pj:(l'- 0 )(K- 0 )^)) 
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= tr(PxA)=tr((XX^) ‘XAX^)=C>(1) 

by conditions (D) and (E). Thus, in aggregate, we have 


E 


sup |(III)i| j < 2 Mp*^-^o(p^l^^O{\) = o{l). 


\tie^ 

We finally consider the (III) 2 term. We have 


sup 1(111)2! = ~ 
Q<X<o=.tie^ P ne^0<X<o- 


Xdi 


^ A + r/; L 


ZA-^tl{Y -efz^ 


l=J 


< — sup max 

<^supV \zA-^niY-efz^' 
P iieJfi=i L 

= — sup £|[ZA-V]i[Z(T-0)],-| 
P ne^ti 


£4 ZA-^H{Y-efZ^ 


24 

<- sup , 

P ne^\t 




/=1 


M 


I [zir-»)]?. 

1=1 


Thus, by Cauchy-Schwarz inequality 


E 


sup 

\0<X<^, 



24 


sup £ [ZA 'n] 


X 



Note that 

sup £ [ZA-'m]? = sup £ (" £ [ZA- 1 ] [^]\ 

M^-=^ 1=1 fXGJf i=l £m=l / 

< sup £ f £ [ZA-'];,„ X £ [nt] = sup £ ([ZA-^Z^]Jnf) 
1=1 \m=l m=l / /=1 

= tr(ZA- 2 z^) sup ll/xf = tr(WZA- 2 x^)(Mp'^||y||)" = o(p 2 )||yf, 


where the last equality uses condition (G). Thus, 
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Also note that 

=E(tr(z^z(y-0)(y-0)^)) 

= tr (Z^ZA) = tr {WXAX'^) = O {p) 

by condition (D). Recall that di^ = O {p^^) by condition (F). It follows that 


E 


sup 1 ( 111 ) 2 ! <-0(p ^)o(p^^Ao(p^^A =o{l), 

iO<A<oo,^g^ J P \ \ / 


which completes our proof. 

Proof of LemmaS The fact that s .if is trivial as 


M 


OLS 


= X‘ {XX‘) ^XY = PxY, 


while the projection matrix Px has induced matrix 2-norm ||Px|l 2 = 1- Thus, 




OLS 


<||Px|l2l|T|| = ||y||.ForA 


WLS 


note that 


=X'^ {XA-^X'^) ^XA ^Y 

= A'/2 (XA-^I^'^A-^I^Y 

= A'/2(p^^_./,)A-'%, 


where Pxa ordinary projection matrix onto the row space of XA and 

has induced matrix 2-norm 1. It follows 




WLS 


< 


A'/2 


A-d-xWl 




max A^^ X max A,. x jjy jj . 

\<i<p ' \<i<p ' 


Condition (A) gives 


..1/2 

max A- = 

l<i<p ‘ 


(max4)'/-<(X;4)'/‘ = 0(;,'/4L 

l<i<p “ V / 


Similarly, condition (A') gives 

max A7'/2 = ( ^^ax A72-5)l/(4+25) < (f ^-2-5)l/(4+25) ^ ^ (^l/(4+25)\ _ 
l<i<p ^<i<p ^ ^ 

We then have proved that 


M 


WLS 


< O 


(p'/ 4 )o(pl/( 4 + 25 ))||y||^^(^.)||y|| 
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Proof of Theorem 12 To prove the first assertion, note that 
URE < URE 

by the definition of and so Theorem[T]implies that 


ipie,e 


-In 


<ip 


(e,0-) 

( 0 , 0 '^'^'^) - URE+URe(b°‘^,/1 


-I, 




< 2 sup 
BeSg, /iGif 


URE(B,jU) —/p ^0,0^’^j —7> 0 in and in probability, (22) 


where the second inequality uses the condition that g . Thus, for any e > 0, 


>(/p(0,0''''‘') >lp(0,0°'^)+e) 




2 sup 

i Be^, fieJC 


URE(B,/x)-/p(0,0^''') 


> e 


0 . 


To prove the second assertion, note that 

/p(0,0°^)<lp(0,0"'") 

by the definition of 0°^ and the condition g Thus, taking expectations on 
equation (l22l) easily gives the second assertion. 

Proof of Corollary[TJ Simply note that 


I 


p 




by the definition of 0°^. Thus, 

,,(e,e““) -( 9 ,-i,( 9 ,9°'^). 

Then Theorem|2]clearly implies the desired result. 

Proof of Theorem 12 We observe that 

URE^ {B)-lp(^e, = URE (B,/i") -Ip (^0, {A+B)-^Pm.xA 


Since 


sup 


< sup 

URE(B,M)-/pf0,0^''') 

Bern 

^ ^ V / 

Bess, 

\ / 










30 


S. C. Kou and Justin J. Yang 


by Theorem[Tl we only need to show that 


sup 

Bern 


-tr^A(A + B) ^PmxA 


—?► 0 as p 


Under Model I, 


tr (A + B) ' Pm,xA^ = ^ [PmxM a 

(P A- ” V 

\ 1/2 

[PM,X^fii\ 

< p'y^tr {PmxA{Pm,xAY): for all a >0, 

but tr {PM,xAAPlf x) = tr (x'^ {XMX'^y^XMA^MX'^ {XMX'^y'x'^ 

= tT(^{XMX^y^ (XMA^MX^) {XMX^y^ = 0(1) by (O and condition 

(E). Therefore, 


<{pxY 


sup 


-tr ^A (A + B) ^ Pmj(A 


= 0(1) = 0(p^'/^) ^ 0. 


Under Model 11, A (A + B) ^ ^ (14 + A) ^ ‘ AZA^ \ where 

^-i/2y^-i/2 ^ uaU^,A = diag [di,...,dk) withc/i < • • • <4, andZ = U'^W^^^X 
as defined in the proof of Theorem[T] Thus, 

tr(A(A+B)-'BM,xA) =tr(i>M.YA)-tr(lZ^(A4+A)-'AZA-'BM.xA) . 

We know that tr(PM,xA) = tr (^{XMX^) * {XMAX^)j = 0(1) by the assumption 

O. tr (iZ^ (14 +A)^^AZA^'Pm,xA) =tr(^A (14+A)^^ AZA^^Pm,x>4Z^) 

= tr(^l(14+A)^'AZA^'j!f^(XMl'^)^‘zMAZ^y The Cauchy-Schwarz in¬ 
equality for matrix trace gives 

tr ( (l (14 + A )^ ‘ A) (ZA- {XMX^y ^ XMAZ^^ ) 

<tr'/2((l (14+A)-^A)2) 

X tr'/2 (ZA-'^X^ {XMXy ^XMAZ^ZAMX^ {XMX'^Y^XA-^Z'^'^ . 


Since 
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tr (A/i:+A) ^ i^ x+d^ <kdl = 0[p forallA>0 

as shown in the proof of Theorem[T]and 

tr (ZA-^X'^ {XMX^y^XMAZ^ZAMX^ {XMX^y^XA-^Z^'j 
= \i(^{XMXy^^XMAZ'^ZAMX'^ {XMX'^y^XA-^Z'^ZA-^X'^'j 
= tr((ZMX^)^‘ {XMAX^)W{XAMX^) {XMX'^y^ (XA-'X^)W(XA-^Z^)) 

= 0{p^) 

from ( fOl ) and condition (F), we have 

sup -tr(A(A+B)^'PM.xA 

BeSg P ^ 

This completes our proof of (fl4li . With this established, the rest of the proof is 
identical to that of Theorem|2]and Corollary [T] 


= ^\ 0{l) + \jo{p 2) X(9(p2)j =(9(p ')->0. 
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